Part 1: Model building in scikit-learn



In [1]:

    
# load the iris dataset as an example
from sklearn.datasets import load_iris
iris = load_iris()



In [2]:

    
# store the feature matrix (X) and response vectore (y)
X = iris.data
y = iris.target

"Features" are also known as predictors, inputs or attributes. The "reponse" is also known as the target, label or output.



In [3]:

    
# check the shapes of X and y
print(X.shape)
print(y.shape)









    



(150, 4)
(150,)

"Observations" are also known as samples, instances, or records.



In [5]:

    
# examine the first 5 rows of the feature matrix
import pandas as pd
pd.DataFrame(X, columns=iris.feature_names).head()









    Out[5]:






  
    
      
      sepal length (cm)
      sepal width (cm)
      petal length (cm)
      petal width (cm)
    
  
  
    
      0
      5.1
      3.5
      1.4
      0.2
    
    
      1
      4.9
      3.0
      1.4
      0.2
    
    
      2
      4.7
      3.2
      1.3
      0.2
    
    
      3
      4.6
      3.1
      1.5
      0.2
    
    
      4
      5.0
      3.6
      1.4
      0.2



In [6]:

    
y[:5]









    Out[6]:





array([0, 0, 0, 0, 0])



In [8]:

    
pd.Series(y).value_counts()









    Out[8]:





2    50
1    50
0    50
dtype: int64

In order to build a model, the features must be numeric, and every observation must have the same features in the same order.



In [10]:

    
# import the class
from sklearn.neighbors import KNeighborsClassifier

# instantiate the model (with the default parameters)
knn = KNeighborsClassifier()

# fit the model with data (occurs in-place)
knn.fit(X, y)









    Out[10]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In order to make a prediction, the new observation must have the same features as the training observations, both in number and meaning.



In [12]:

    
# predict the response for a new observation
knn.predict([[3, 5, 4, 2]])









    Out[12]:





array([1])

Part 2: Representing text as numerical data



In [13]:

    
# example text for model training (SMS messages)
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']



In [14]:

    
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

vect.fit(simple_train)









    Out[14]:





CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)



In [15]:

    
vect.get_feature_names()









    Out[15]:





['cab', 'call', 'me', 'please', 'tonight', 'you']



In [16]:

    
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm









    Out[16]:





<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>



In [17]:

    
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()









    Out[17]:





array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]])



In [18]:

    
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())



In [19]:

    
# examine the sparse matrix contents
print(simple_train_dtm)









    



  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2

As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).

For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

In order to be able to store such a matrix in memory but also to speed up operations, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.



In [20]:

    
# example text for model testing
simple_test = ["please don't call me"]

In order to make a prediction, the new observation must have the same features as the training observations, both in number and meaning.



In [21]:

    
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()









    Out[21]:





array([[0, 1, 1, 1, 0, 0]])

Summary:

vect.fit(train) learns the vocabulary of the training data
vect.transform(train) uses the fitted vocabulary to build a document-term matrix from the training data
vect.transform(test) uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

Part 3: Reading a text-based dataset into pandas



In [31]:

    
# read file into pandas using a relative path
# alternative: read file into pandas from a URL
# url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'
path = 'sms.tsv'
sms = pd.read_table(path, header=None, names=['label', 'message'])



In [32]:

    
# examine the shape
sms.shape









    Out[32]:





(5572, 2)



In [33]:

    
# examine the first 10 rows
sms.head(10)









    Out[33]:






  
    
      
      label
      message
    
  
  
    
      0
      ham
      Go until jurong point, crazy.. Available only ...
    
    
      1
      ham
      Ok lar... Joking wif u oni...
    
    
      2
      spam
      Free entry in 2 a wkly comp to win FA Cup fina...
    
    
      3
      ham
      U dun say so early hor... U c already then say...
    
    
      4
      ham
      Nah I don't think he goes to usf, he lives aro...
    
    
      5
      spam
      FreeMsg Hey there darling it's been 3 week's n...
    
    
      6
      ham
      Even my brother is not like to speak with me. ...
    
    
      7
      ham
      As per your request 'Melle Melle (Oru Minnamin...
    
    
      8
      spam
      WINNER!! As a valued network customer you have...
    
    
      9
      spam
      Had your mobile 11 months or more? U R entitle...



In [34]:

    
# examine the calss distribution
sms.label.value_counts()









    Out[34]:





ham     4825
spam     747
Name: label, dtype: int64



In [36]:

    
# examine the calss distribution
sms.label.value_counts() * 100 / sms.shape[0]









    Out[36]:





ham     86.593683
spam    13.406317
Name: label, dtype: float64



In [38]:

    
# convert label to a numerical variable
sms['label_num'] = sms.label.map({
    'ham' : 0,
    'spam': 1
})



In [39]:

    
# check that the conversion worked
sms.head(10)









    Out[39]:






  
    
      
      label
      message
      label_num
    
  
  
    
      0
      ham
      Go until jurong point, crazy.. Available only ...
      0
    
    
      1
      ham
      Ok lar... Joking wif u oni...
      0
    
    
      2
      spam
      Free entry in 2 a wkly comp to win FA Cup fina...
      1
    
    
      3
      ham
      U dun say so early hor... U c already then say...
      0
    
    
      4
      ham
      Nah I don't think he goes to usf, he lives aro...
      0
    
    
      5
      spam
      FreeMsg Hey there darling it's been 3 week's n...
      1
    
    
      6
      ham
      Even my brother is not like to speak with me. ...
      0
    
    
      7
      ham
      As per your request 'Melle Melle (Oru Minnamin...
      0
    
    
      8
      spam
      WINNER!! As a valued network customer you have...
      1
    
    
      9
      spam
      Had your mobile 11 months or more? U R entitle...
      1



In [40]:

    
# define X and y from the SMS data for use with CountVectorizer
X = sms.message
y = sms.label_num
print(X.shape)
print(y.shape)









    



(5572,)
(5572,)



In [42]:

    
# split X, y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1087)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)









    



(4179,)
(1393,)
(4179,)
(1393,)

Part 4: Vectorizing our dataset



In [43]:

    
# instantiate the vectorizer
vect = CountVectorizer()

# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

# examine the document matrix
X_train_dtm









    Out[43]:





<4179x7468 sparse matrix of type '<class 'numpy.int64'>'
	with 55716 stored elements in Compressed Sparse Row format>



In [44]:

    
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm









    Out[44]:





<1393x7468 sparse matrix of type '<class 'numpy.int64'>'
	with 17087 stored elements in Compressed Sparse Row format>

Part 5: Building and evaluation of a model



In [45]:

    
# import and instatiate a multinomial naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()



In [46]:

    
# train the model using X_train_dtm (timing it)
%time nb.fit(X_train_dtm, y_train)









    



CPU times: user 9.03 ms, sys: 0 ns, total: 9.03 ms
Wall time: 7.44 ms






    Out[46]:





MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)



In [47]:

    
# make class prediction for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)



In [48]:

    
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)









    Out[48]:





0.98133524766690594



In [49]:

    
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)









    Out[49]:





array([[1200,    4],
       [  22,  167]])



In [50]:

    
# print message text for the false positives (ham incorrectly classified as spam)
X_test[ y_test < y_pred_class]









    Out[50]:





4862                    Nokia phone is lovly..
4382    Mathews or tait or edwards or anderson
4703                                Anytime...
4702                    I liked the new mobile
Name: message, dtype: object



In [51]:

    
# print message text for false negatives (spam incorrectly classified as ham)
X_test[y_test > y_pred_class]









    Out[51]:





1875    Would you like to see my XXX pics they are so ...
2003    TheMob>Yo yo yo-Here comes a new selection of ...
5370    dating:i have had two of these. Only started a...
5       FreeMsg Hey there darling it's been 3 week's n...
1458    CLAIRE here am havin borin time & am now alone...
2774    How come it takes so little time for a child w...
1269    Can U get 2 phone NOW? I wanna chat 2 set up m...
955             Filthy stories and GIRLS waiting for your
3425    Am new 2 club & dont fink we met yet Will B gr...
4256    Block Breaker now comes in deluxe format with ...
3419    LIFE has never been this much fun and great un...
3864    Oh my god! I've found your number again! I'm s...
3991    (Bank of Granite issues Strong-Buy) EXPLOSIVE ...
5037    You won't believe it but it's true. It's Incre...
1430    For sale - arsenal dartboard. Good condition b...
3360    Sorry I missed your call let's talk when you h...
4514    Money i have won wining number 946 wot do i do...
3460    Not heard from U4 a while. Call me now am here...
3501    Dorothy@kiefer.com (Bank of Granite issues Str...
1137    Dont forget you can place as many FREE Request...
684     Hi I'm sue. I am 20 years old and work as a la...
2823    ROMCAPspam Everyone around should be respondin...
Name: message, dtype: object



In [52]:

    
# example false negative
X_test[5037]









    Out[52]:





"You won't believe it but it's true. It's Incredible Txts! Reply G now to learn truly amazing things that will blow your mind. From O2FWD only 18p/txt"



In [53]:

    
# calculate predicted probabilities for X_test_dtm (poorly calibrated)
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob









    Out[53]:





array([  4.55405595e-06,   1.79459488e-16,   9.86454990e-08, ...,
         9.80960813e-10,   2.01202626e-04,   1.03984579e-06])



In [54]:

    
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)









    Out[54]:





0.96630719471251036

Part 6: Comparing models

We will compare multinomial naive Bayes with logistic regression



In [55]:

    
# import and instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
lreg = LogisticRegression()



In [57]:

    
# train the model using X_train_dtm
%time lreg.fit(X_train_dtm, y_train)









    



CPU times: user 156 ms, sys: 15.8 ms, total: 172 ms
Wall time: 158 ms






    Out[57]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



In [59]:

    
# make class predictions using X_test_dtm
y_pred_class = lreg.predict(X_test_dtm)



In [60]:

    
# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob = lreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob









    Out[60]:





array([ 0.00365935,  0.00131051,  0.00370401, ...,  0.00567854,
        0.01299843,  0.01867533])



In [61]:

    
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)









    Out[61]:





0.97631012203876522



In [62]:

    
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)









    Out[62]:





0.9841181950816501

Part 7: Examining a model for further insight



In [ ]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	label	message
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...
5	spam	FreeMsg Hey there darling it's been 3 week's n...
6	ham	Even my brother is not like to speak with me. ...
7	ham	As per your request 'Melle Melle (Oru Minnamin...
8	spam	WINNER!! As a valued network customer you have...
9	spam	Had your mobile 11 months or more? U R entitle...

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2